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Abstract —In this paper, we present a robotic model-based 
reinforcement learning method that combines ideas from model 
identification and model predictive control. We use a feature- 
based representation of the dynamics that allows the dynamics 
model to be fitted with a simple least squares procedure, 
and the features are identified from a high-level specification 
of the robot’s morphology, consisting of the number and 
connectivity structure of its links. Model predictive control is 
then used to choose the actions under an optimistic model of 
the dynamics, which produces an efficient and goal-directed 
exploration strategy. We present real time experimental results 
on standard benchmark problems involving the pendulum, 
cartpole, and double pendulum systems. Experiments indicate 
that our method is able to learn a range of benchmark 
tasks substantially faster than the previous best methods. To 
evaluate our approach on a realistic robotic control task, we 
also demonstrate real time control of a simulated 7 degree of 
freedom arm. 

I. Introduction 

Model-based control of robotic systems requires model 
identification to be performed before the system can be 
effectively controlled, particularly for dynamic, high-speed 
motions. One way to tackle this challenge is to first per¬ 
form model identification, and then use the identified model 
to control the system [16], [20]. However, this approach 
requires a dedicated model identification step, which can 
become inefficient if the dynamics change frequently or 
suddenly, or if the robot interacts with unfamiliar physical 
objects that must each be identified. 

Reinforcement learning (RL) offers a framework for auto¬ 
matically trading off exploration and exploitation to complete 
the task as quickly as possible. Model-based RL reduces the 
required system interaction time by learning a model of the 
dynamics, while still trading off exploration and exploitation 
to learn a model that is just detailed enough to succeed at 
the task. General-purpose statistical models are often used 
to represent the dynamics, but such models can require 
a substantial number of samples to acquire a sufficiently 
accurate dynamics estimate [13], [21]. 

In this paper, we combine concepts from model identifi¬ 
cation and model-based reinforcement learning to complete 
the task as quickly as possible while identifying the sys¬ 
tem online to acquire a model that is sufficient for task 
completion. In contrast to prior methods that use generic 
statistical models of the dynamics, we use a feature-based 
least-squares formulation of the model identification prob¬ 
lem, which allows the model to be identified extremely 
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Fig. 1. 7 DoF arm learning to reach a tai'get pose using our method. Later 

time steps shown with higher opacity, with the target pose fully opaque. 

quickly by bringing in our prior knowledge about the robot’s 
morphology and physics. Exploration is performed using an 
optimistic model-predictive control (MFC) framework [21], 
which determines the optimal trajectory under an optimistic 
formulation of the system dynamics. As more interaction 
samples are gathered, the amount of optimism is reduced, 
until the method converges to the true dynamics. 

Our main contribution is a method for combining 
optimism-driven exploration with simple least-squares model 
fitting based on physical features of the dynamics that can be 
extracted automatically from a high-level specification of the 
system morphology (e.g., the number and connectivity of the 
links, which is readily available for most robotic systems). 
These features can be obtained manually for simple systems, 
or can be computed automatically using existing packages 
such as SymPyBotics [26]. We first present experimental 
results on the pendulum, double pendulum, and cartpole 
benchmarks. Compared to prior methods, our approach is 
able to solve these tasks using the lowest amount of system 
interaction time. In comparison to a prior method based on 
optimism-driven exploration [21], our approach also requires 
much less computation time, making it suitable for real¬ 
time online learning. To evaluate our method on a realistic 
robotic control task, we also demonstrate real time control 
of a simulated 7 degree of freedom arm, shown in Figure 1. 

II. Related Work 

System and model identification has been explored exten¬ 
sively in the context of robotics [16], [20]. Several methods 
have been proposed in the literature for finding good ex¬ 
citation trajectories for model identification [5], [14], [25], 
[28], [31], [32]. The physical feasibility of parameters during 
the identification process has also been considered [15], 


[26]. However, model identification is typically performed 
as a separate process from control. In contrast, our work 
addresses the problem of controlling a robotic system to 
complete a task as quickly as possible, while learning a 
model sufficient for task completion. 

One alternative to offline model identification is provided 
by adaptive control [6]. Adaptive control offers compelling 
convergence and stability guarantees, but is typically con¬ 
cerned with stabilizing around a target trajectory under 
a linear model. This makes it difficult to apply to more 
complex, nonlinear robotic systems with high-level goals 
defined by arbitrary cost functions. 

Reinforcement learning (RL) [18], [27] tackles control 
problems with nonlinear dynamics in a more general frame¬ 
work, which can be either model-based or model-free. 
Model-based RL reduces the required interaction time by 
learning a model of the system during execution, and opti¬ 
mizing the control policy under this model, either offline 
in an episodic setting, or online. In the context of RL, 
exploration refers to intentionally taking suboptimal actions 
to improve future performance. Although many methods 
have been proposed for model-based RL with efficient explo¬ 
ration in discrete MDPs, data-efficient model-based RL for 
continuous systems remains a challenging problem despite 
substantial recent advances [1], [13], [19], [21]. 

Although several very successful model-based RL meth¬ 
ods have been proposed recently [8], [13], [21], such meth¬ 
ods typically use general-purpose statistical models of the 
dynamics. Such models are very flexible, but require a 
substantial number of samples to learn models that are 
accurate enough to succeed at the task. Several prior methods 
have suggested incorporating knowledge about the dynamics 
as a prior on the dynamics model [11], [12], [23]. However, 
these methods typically assume that prior knowledge comes 
in the form of predictions about the next state, which is a 
very specific and quantitative type of prior. Our approach 
incorporates prior knowledge about the dynamics of rigid- 
body systems, but does not assume knowledge of system 
parameters, such as masses and link lengths. Without these 
parameters, this prior model cannot give reasonable predic¬ 
tions. However, it tells us a great deal about the structure of 
the dynamics. The equations of motion can be written such 
that the parameters and model features decompose linearly, 
providing for a very efficient learning algorithm. 

In order for the algorithm to learn the model at the 
same time as it performs the task, we use an optimism- 
driven exploration strategy combined with model-predictive 
control to continuously replan the next action. Prior work 
has proposed to use optimism-driven exploration mainly for 
discrete systems [4], [19]. A recent extension to continuous 
nonlinear systems provides for sample-efficient learning [21], 
but does not run in real time, making it impractical for real 
applications. We use an efficient MPC method based on 
differential dynamic programming (DDP) [17], which allows 
us to achieve real-time performance even while continuously 
refitting the model parameters with each new sample. 


Algorithm 1 Model-based RL with MPC and optimistic 
exploration 

Require: Start state Xstart = [q,q]start, cost function /(x,t), 
sampling frequency v^, control frequency 
1: z <— Random controls 
2: o Empty list of observations 

3: repeat 

4: Execute T for 1/Vc seconds 

5: Append current [q,q,q,T] to o every l/v* seconds 

6: Estimate dynamics /(q,q,T) using samples in o 

7: Construct optimistic dynamics /(q,q,'C,^) =q 

8: Optimize a trajectory from [q,q] using /(q,q,T,4) 

with virtual control penalty ;^||^|l 2 
9: Update T to be the first action along this trajectory 

10: until task completion 


III. Optimistic Exploration with Continuous 
Model Identification 

Our method combines feature-based model identification 
with optimistic exploration in an online model-based rein¬ 
forcement learning algorithm. An outline of the method is 
presented in Algorithm 1. Here, q denotes the configuration 
of the robot, x = [q,q] is the state of the system, and T is the 
commanded action (e.g. the joint torques and forces). The 
task is specified by a cost function 1(x,t), which typically 
depends on the distance between the current state x and some 
target state Xgoai. We begin with a random initial action and 
empty list of observations o. We then repeatedly execute 
the current action T for 1 /Vc seconds, while collecting new 
observations [q,q,q,T] at a frequency of Vj. These observa¬ 
tions are appended to o. This list of observations is used to 
estimate the system dynamics /(q,q,T) =q, as described in 
Section III-A. These estimated dynamics are then converted 
into optimistic dynamics /(q,q,T,^) = q by including a set 
of virtual controls ^ to allow the MPC algorithm to take 
optimistic action, as described in Section III-B. Using these 
estimated dynamics, the method then plans a fixed-horizon 
trajectory that minimizes the cost 1(x,t) from the current 
state X by using differential dynamic programming (DDP), 
as described in Section III-C. The next action T is then 
set to the first action along this trajectory, following the 
model predictive control (MPC) paradigm [10]. This process 
is repeated until a specified goal condition is reached. 

A. Model Identification via Least Squares 

In this section, we describe how the approximate dynamics 
/(q,q,T) are fitted to samples o = {[q,q,q, T]i}. This method 
assumes that we know the morphology of the robot (the 
number and connectivity of its links), and therefore can write 
down its equations of motion. However, we do not necessar¬ 
ily know the physical parameters, such as the masses and 
lengths of the links. This assumption is reasonable for many 
physical systems, since the morphology and connectivity of 
the links can easily be ascertained from observation, but the 
physical parameters require a complex system identification 






procedure. For a robot consisting of a system of articulated 
rigid bodies, the equations of motion can be decomposed 
such that model identification can be formulated as linear 
regression, making the dynamics fitting simple and efficient. 
While this technique is known in the model identification 
literature [16], we present it here for completeness. 

Using M(q) to denote the mass matrix, C(q, q) to represent 
the Coriolis and centripetal forces, g(q) to represent gravity, 
and T the forces and torques, the equations of motion are 
given by 

M(q)q + C(q,q)+g(q) = T. (1) 

These dynamics equations can be written as: 

iy(q,q,q)-A = T, 

where the vector A depends only on system parameters and 
the matrix //(q,q,q), also referred to as the regressor matrix, 
does not depend on the system parameters. Model identifi¬ 
cation can then performed by estimating A. For instance, the 
vector A = ... <5^] ^ for /T-link manipulators consists 

of the inertial parameters for each link. For under-actuated 
systems, the dynamics can be expressed in the same form. 
A key difference is that we have zeroes in the x vector 
corresponding to the unactuated degrees of freedom. We 
describe how this decomposition is performed for specific 
robotic systems in Section IV. 

We assume that we have noisy observations of the features 
[q,q,q]. Given an observation vector z of the features [q,q,q] 
and generalized joint forces/torques r for N samples, the 
vector A may be inferred by least squares regression as: 

A = argmin||AA —b||^ 

A 

where 



■i/(qi,qi,qi)' 


n 

A = 


, b = 



.^(qA'iqATjqv). 


JN. 


Since //(q,q,q) is typically not full rank, A is not full rank 
and the solution is an affine subspace. Thus, we use the 
Moore-Penrose pseudo-inverse to get our solution A=A^b. 
This gives us the least norm solution in the affine subspace. 
Once we estimate A, we can recover the forward dynamics 
equation /(q,q,T) = q by solving the equations of motion 
in ( 1 ) with respect to q. This forward dynamics estimate can 
then be used with any model predictive control method to 
choose locally optimal actions. However, acting greedily with 
respect to this dynamics estimate is not always desirable. 
When the dynamics are incorrect, it may be preferable to 
instead take actions that are suboptimal under the estimated 
model, but that are more effective at exploring the state 
space of the system, in order to acquire a better estimate 
of the dynamics that can allow the method to more quickly 
reach the goal. In the next section, we describe one particular 
exploration method that involves constructing an optimistic 
estimate of the dynamics. 


B. Optimistic Exploration 

Our method uses model predictive control (MPC) to 
choose the actions. In order to perform the task quickly 
while identifying the model to a sufficient degree for task 
completion, we augment MPC with optimistic exploration. 
This combination of MPC with exploration allows for the 
exploration strategy to change online as the model is updated. 
The intuition behind this exploration strategy is that, when 
the dynamics are uncertain, the algorithm is allowed to 
choose which of the dynamics models it prefers, among those 
models that are highly probable given the data. If the algo¬ 
rithm chooses an accurate dynamics model, it will complete 
the task. If it chooses an inaccurate model, it will receive 
observations that show that this model is inaccurate, and the 
dynamics estimate will be improved. In prior work, this type 
of exploration strategy was shown to substantially improve 
the sample-efficiency of model-based RL [21]. However, this 
prior method suffered from very long computation times, 
which made it impractical for real-time online control. In 
this section, we present a simplified variant of the optimistic 
exploration framework suitable for real-time applications. In 
the next section, we show how it can be incorporated into 
a simple and efficient DDP algorithm to allow for efficient, 
real-time control. 

In order to allow MPC to choose among the likely 
dynamics models, we introduce slack variables into the 
dynamics, such that q, =/(q^qr,'^'t) + = f(qt,qt,'ft,6)> 

where / is the new optimistic dynamics model. When these 
slack variables are treated as virtual controls by MPC, they 
enable optimistic exploration. Intuitively, they account for 
uncertainty about the dynamics due to imprecise estimates of 
the vector A. To keep MPC from choosing highly improbable 
dynamics, the slacks are penalized quadratically during MPC 
with a penalty of the form The magnitude of 

exploration is controlled by m, which should be proportional 
to the amount of uncertainty about the current dynamics.' 
Previous work used Bayesian models to accurately estimate 
this uncertainty [21]. In this work, we simply decrease m 
as the number of samples N increases. While this approach 
is somewhat simplistic, we found that it works well in 
practice. Establishing a formal bound on m in terms of the 
number of samples N is difficult due to the complexity of 
the physical model. However, we can roughly estimate this 
bound by considering a simplified linear-Gaussian model of 
the dynamics. Given a multivariate Gaussian with mean fio 
and covariance Eq. the variance of the posterior estimate of 
the mean after the update is given by (Eq * -f AtE^*)^', where 
E is the sample variance [22]. This suggests that, for large N, 
the variance of the mean decreases roughly as 1/At with the 
number of samples N. For simplicity, we use only a single 
exploration hyper-parameter c, using ni = as an estimate 
of the uncertainty about the model. This makes it easier 
to adjust the amount of exploration by tweaking a single 

*Note that the uncertainty about the model is not the same as dynamics 
noise. In this work, we assume deterministic dynamics, though stochastic 
dynamics could also be handled in this framework. 






parameter. 

This optimistic exploration scheme has the effect that the 
system is steered into taking actions that either move it 
toward the goal, or else update the model if the previously 
chosen path to the goal is incorrect, so that another route is 
attempted on the next replanning step. In the case of linear 
dynamics or discrete systems, this optimistic exploration 
scheme has a number of desirable theoretical properties that 
make it a good choice [1], [3], [4], [7]. Although such results 
do not exist for the general continuous nonlinear case, we 
observed that the optimistic exploration strategy empirically 
achieves effective exploration in practice. 

C. Model Predictive Control 

To achieve real-time control for online reinforcement 
learning, we use a simple and efficient differential dynamic 
programming (DDP) algorithm to choose locally optimal ac¬ 
tions u = which include both real and virtual controls, 
with respect to the optimistic dynamics /(qr,qr, = q- 
The actions are optimized with respect to an augmented 
cost function of the form /(x,u) =/(x, t)- f ^ ||^r|p, which 
includes both the actual cost of the task and the penalty on 
the virtual controls. We first convert the optimistic forward 
dynamics into a discrete dynamics equation of the form 
x,+i =/(x,,Uf) by using a fourth order Runga-Kutta inte¬ 
grator, and then supply these dynamics and cost function to 
a DDP algorithm, which we summarize in this section for 
completeness. Once this algorithm determines a sequence of 
locally optimal actions, we extract T from the first action and 
apply this control to the system. 

The optimal control problem we aim to solve can be 
formulated as 

r-i 

min y^f(x,,u, ) (2a) 

xrr,10:7-1 

f=0 

subject to: x, =/(x,_i,Ur_i), Vf € 1,... ,7 (2b) 

The goal is to find the set of controls Uo:r-i that minimizes 
the cost function starting from the current state xq. We 
use a variant of DDP called iterative LQR (iLQR), which 
requires only a first order expansion of the dynamics [29]. 
This method is particularly fast, making it well suited for 
MPC. The rest of this section summarizes this method. 
The algorithm iteratively computes first order expansions 
of the dynamics and second order expansions of the cost 
around the current trajectory, and then analytically com¬ 
putes the sequence of optimal controls with respect to this 
approximation. This sequence of controls is then executed 
to obtain a new trajectory, and the process repeats until 
convergence or for a fixed number of iterations. The controls 
are computed by a dynamic programming procedure that 
consists of recursively updating the value function and Q- 
function, defined as 

T-l 

V,(x,) = min y^I{xi,Ui) 

Ur:r-1 . —^ 
i=t 

Q{x,,u,) = !{x,,u,) +V,+i{f{x,,Ut)). 


Under the LQR assumptions, both of these functions are 
quadratic, and can be expressed up to a constant as 

Vt(x,) = 


X, 

T 

Qxj 

Qxu,t 


Xr 

-f 

Xr 

T 

Qx,t 



Qux,t 

Qmijt 



.“l 


_Qu,t 


Let lx,t,lu,t,lxx,t,lux,t,lmi,t denote the first and second deriva¬ 
tives of the cost function ^(x,,u,), and fx,t,fii.t denote the 
derivatives of the discretized dynamics. The coefficients can 
be written as a recurrence described by 

Qx,t = ix,t + fl,t^x,t+l 

Qu.t — lu,t fu,t^x,t+l 

Qxx,t — 

Qliu^T - ^uu,t + fu,t^xx,t+lfu,t 

Q,UX,t — lux,t + fu,t^xx,t+lfx,t 

Vx,t = Qx,t ~ Ql,x,tQuu.tQu,t 

Vxx,t = Qxx,t ~ Qjix,tQuu,tQux,t■ 

With this recurrence, we can obtain the optimal policy 
g{x,) = u, -fk, -1-K,(x, -X,), where k, = -QultQu^t is the 
open loop term and K, = —Quu.tQux,t is the closed loop feed¬ 
back gain term. Because we are using an MPC framework, 
we only execute a small portion of the converged optimal 
control policy. This makes it very convenient to use the 
previous found solution as a warm-start, which allows for 
fast convergence. 

One final detail in this framework is that, in the early 
stages of learning, the estimate of the model parameters A 
may be too inaccurate to perform stable forward and back¬ 
ward passes with MPC. If we detect that the forward pass 
diverges, we revert to a simple double-integrator dynamics 
model. Typically, this stage of learning lasts less than one 
second. 

IV. Experiments 

We evaluated our method on a number of standard robotic 
control benchmarks: the pendulum, cartpole, and double 
pendulum, as shown in Figure 3, as well as on a 7 degree 
of freedom arm, shown in Figure 5. For each system, our 
method obtains a noisy observation of the features [q,q,q]^, 
where the noise is additive and drawn from a zero-mean 
Gaussian, and is tasked with reaching a target state as quickly 
as possible. Our implementation was in Python and C-H- and 
ran with on a single 3.2 Ghz Intel processor. 

A. Benchmark Tasks 

The pendulum, cartpole, and double pendulum bench¬ 
marks require controlling an underpowered system to swing 
up and place the endpoint of the last link at the target 
position. Control limits prevent each system from swinging 
up by continuous application of the same torque, requiring 
long-horizon planning. The cost function for our method 
consists of the distance between the endpoint of the last link 
and the target, as well as terms to penalize large velocities 













pendulum 

cartpole 

double pendulum 

DDP with known dynamics 

3.04 ± 0.89s 

7.44 ± 3.26s 

3.7 ± 0.89s 

our method 

3.28 ± 1.17s 

8.31 ± 3.15s 

4.98 ± 1.83s 

optimism-driven exploration [21] 

3.9 ± Is 

10 ± 3s 

17 ± 7s 

Boedecker et al. [9] 

— 

12-18S 

— 

PILCO [13] 

12s 

17.5s 

50s 


Fig. 2. Interaction time required to successfully learn each benchmark task for our method, DDP with known dynamics, and the best prior methods. 



Fig. 3. Benchmark tasks: Pendulum (left), cartpole (center), double 
pendulum (right) 

and controls. To impose control limits, we pass the controls 
from DDP through a squashing function of the form s{u) — 
2c{g{u) —0.5), where (j(.) is the logistic function and c is 
the control limit. The cost function therefore has the form 

T 

l(x„u,)= ^ \J (p(x,) -p*)T2p(p(x,) -p*) + a 

t=0 

+ ^ [x72,,Xf + i(u,)T/;i(u,)+u/Pu,] , 

where the hrst term is a Huber-like loss on the distance 
to the target endpoint position p*, Qp is a diagonal weight 
matrix, and a is a smoothing constant. The velocity cost is 
weighted by a diagonal weight matrix Qy, and the controls 
are penalized both after squashing under R and before the 
squashing, under P, as recommended in prior work [30]. 
Success at each task required reducing the distance between 
p(x,) and p* to less than 0.05 units. We ran 50 trials for 
each benchmark system. 

The regressor matrix for the cartpole and the double 
pendulum systems was obtained by manually factoring the 
equations of motion into //(q,q,q)A = T, while the regressor 
matrix for the pendulum was obtained automatically using 
SymPyBotics [26]. Further details about each system are 
presented in Appendix A. 

B. Benchmark Comparisons 

The results for the pendulum, cartpole, and double pen¬ 
dulum tasks are shown in Figure 2. The most sample- 
efficient previous results on these tasks were obtained using 
optimism-driven exploration with a Dirichlet process mixture 
model [21]. However, the computational requirements of 
this approach prevented it from running in real time, with 
most tasks running at less than one hundredth of real time. 
Our proposed method is able to complete each task in 
real time, by using DDP-based model predictive control 
and a dynamics model that can be refitted efficiently using 


least squares. Other state-of-the-art prior methods shown in 
our results table include PILCO, which uses an episodic 
formulation instead of learning online and therefore runs 
comfortably in real time [13], as well as Boedecker et al. [9], 
which uses Gaussian processes with MPC. We also include 
the time to completion for DDP using the true dynamics for 
each task, to provide a lower bound on the possible time to 
completion. 

Our method achieves the best sample efficiency on each 
of the benchmark tasks. In fact, the time to completion on 
each task is very close to the time attained by DDP with 
known dynamics, indicating that our approach is able to 
identify a sufficiently accurate model of the system extremely 
rapidly. The advantage of our approach increases with system 
complexity, with the more complex double pendulum task 
attaining a time to completion that more than three times 
faster than the previous best approach [21], and ten times 
faster than the previous best approach that can run in real 
time [13]. Furthermore, unlike the previous optimism-driven 
method [21], the computational cost of our approach is 
well within the bounds required for real-time operation. The 
average wallclock computation time for each benchmark is 
shown below: 

pendulum cartpole double pendulum 

2.67 ± 1.06s 6.70 ± 4.49s 3.98 ± 1.66s 

C. 7 Degree of Freedom Ann 

Since the dynamics features for our method can be con¬ 
structed automatically, we can extend it to more complex 
tasks that are representative of real-world robotic control 
problems. To evaluate this capability, we tested our method 
on a simulated Barrett WAM 7 degree of freedom arm. We 
obtained the dynamics features by using the SymPyBotics 
package [26]. The goal of the task was to reach a target 
pose with zero velocity, starting with no prior knowledge 
about the physical parameters of the system, other than the 
dynamics features. Ten target poses were selected at random 
from a spherical Gaussian distribution with a covariance of 
1, centered in the middle of the joint limits. A trial was 
considered complete when the Loo distance to the target pose 
was less than 0.05, and the velocity L„o norm was less than 
0.1. The cost function for this task had the form 

/(x,,Uf) = i [{xt-x*yQ{xt-x*)+s{utyRs{ut)+uJPut] , 

where Q was set to be 20/ for the velocities 
and 50000/ for the joint angles. We chose 
R = diag(0.08,0.00004,0.12,0.04,0.04,0.04,0.04), to 
account for the fact that the bigger shoulder pan 











target pose: 


1 


2 


3 



4 



5 


DDP with known dynamics 

1.43 

± 

0.03s 

1.64 ± 0.02s 

1.34 

± 

0.02s 

2.68 

± 

0.84s 

1.57 

± 

0.03s 

our method 

5.84 

± 

2.76s 

9.11 ± 3.4s 

10.9 

± 

4.62s 

9.14 

± 

6.22s 

3.61 

± 

1.12s 

target pose: 


6 


7 


8 



9 



10 

DDP with known dynamics 

2.05 

± 

0.0s 

0.35 ± 0.09s 

1.9 

± 

0.0s 

2.65 

± 

0.0s 

4.98 

± 

3.32s 

our method 

6.15 

± 

2.64s 

4.6 ± 2.35s 

3.71 

± 

1.34s 

7.77 

± 

2.36s 

9.99 

± 

4.49s 


Fig. 4. Results for ten randomly chosen target poses for 7 DoF arm for DDP with the true dynamics and our method, which learned the dynamics online 
from system interaction. 



Fig. 5. Sample trajectories for the 7 DoF Barrett arm. The target pose is opaque, and the preceding poses become progressively more translucent. Each 
image shows an entire trajectory executed by our method, with poses sampled at 0.25 second intervals. 


joint needed to apply larger torques to raise the 
arm, and set P = /?/100. We used torque limits of 
[±77.3, ±160.6, ±95.6, ±29.4, ± 11.6, ± 11.6, ±2.7], and 

Vc = 20 Hz, Vj = 100 Hz. The observation noise was set to 
have a standard deviation of 0.014 by analyzing the encoder 
accuracy of the Barrett WAM.^ 

The results of these experiments are shown in in Figure 4. 
For each target pose, we ran 10 trials. The results indicate 
that our method was successfully able to move the arm into 
the desired pose in each experiment. Although some of the 
target poses required more time to acquire a sufficiently 
accurate dynamics model, some of the targets were reached 
in time that was only 2-3 times slower than DDP with known 
dynamics. In all cases, the computation time required to find 
a solution was comparable to the interaction time, indicating 
that our method could run in real time. 

Note that the dynamics model of this three-dimensional 7 
degree of freedom arm is much more complex than any of 
the benchmarks in the previous section, and the weights A 
had a dimensionality of 70, compared to less than 10 for the 
benchmark tasks. Several sample trajectories obtained using 
our method are shown in Figure 5. 

V. Discussion and Future Work 

We presented a model-based reinforcement learning 
method that combines ideas from model identification, opti¬ 
mistic exploration, and model predictive control to quickly 
and efficiently learn to perform robotic control tasks under 
unknown dynamics. The key idea in our work is to combine 
efficient linear models of the dynamics, which are informed 
by domain knowledge of articulated physical systems, with 
optimism-driven exploration. The features for these linear 
models can be obtained automatically from the morphology 

^As with most encoders, position readings are substantially more accurate 
than velocity readings. For simplicity, we set the observation noise to 
correspond to the less accurate velocity readings for all entries of the state. 


of the robot, and the optimism-driven exploration can be 
performed using model predictive control. 

Our method achieves state-of-the-art sample efficiency 
on standard benchmark problems, including the pendulum, 
the double pendulum, and the cartpole tasks. Furthermore, 
unlike our previous MFC-based exploration method, which 
used a statistical model of the dynamics [21], our method 
achieves real-time performance, making it feasible for online 
reinforcement learning on real robotic platforms. Unlike the 
prior methods in our comparison, our approach leverages 
additional domain knowledge to greatly simplify the model 
learning problem. This prior knowledge is encapsulated in 
the dynamics features, which can be linearly combined to 
obtain the true dynamics. While this makes the comparison 
somewhat unequal, it serves to illustrate an important point 
in model-based reinforcement learning for robotic control: 
using freely available prior knowledge about the physical 
system can dramatically simplify the model learning and 
control problem. The prior knowledge we use is trivial to 
obtain for most robotic systems, since it consists of the 
number and connectivity of the robot’s links, information 
that can be easily gathered from a cursory examination. 

A number of future directions should be explored to make 
such applications effective and practical. First, although we 
demonstrate state-of-the-art results in simulation and evaluate 
some simulated unmodeled effects, we did not evaluate the 
robustness of our approach to unmodeled effects on a real 
system. The least-squares model identification procedure we 
use has been applied to real robotic systems in the past [16], 
so there is reason to believe that robustness to unmodeled 
effects may already be adequate. However, an interesting 
avenue for future work would be to combine our linear 
models with more expressive statistical models that can 
account for unmodeled effects, similar to prior work on 
autonomous vehicle control [2], [24]. 

We demonstrated our method on a variety of articulated 





systems, but the approach is general enough to apply to other 
kinds of robotic systems also, including autonomous vehicles 
and aircraft. In fact, our prior work already demonstrated that 
optimism-driven exploration can achieve impressive results 
on simulated helicopter control, recovering from an engine 
failure with auto-rotation [21]. Extension of our proposed 
method to such applications requires only a method for 
constructing the corresponding dynamics features, which can 
be obtained from analyzing the equations of motion of the 
system. Another promising application of our approach is 
in robotic manipulation of diverse sets of large, complex 
objects, where performing dedicated model identification as 
a discrete step is impractical. For example, if we imagine a 
construction robot that must handle a variety of previously 
unseen objects in the course of a typical manipulation 
scenario, the ability to very quickly acquire an accurate 
model for model-based planning can substantially improve 
the speed, efficiency, and robustness of the robot’s behavior. 
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Appendix 

A. Benchmark Details 

In this appendix, we present details for each benchmark 
system in our evaluation. 

1) Pendulum: The pendulum system is a simple single 
link system with these nonlinear dynamics [13]: 

q^d, 

u — b9— Imlg sind 

^ = - - 177^7 

where 9 is the joint angle of the link, u is the torque applied 
at this joint, m and / are the mass and length of the link, 
respectively, g is the gravity constant 9.81(m/s^), b is the 
friction coefficient, and I is the moment of inertia around the 



pendulum midpoint, which is equal to The parameters 

are m = 1 kg, 1 = 1 m, and constrained u G [—3,3] N-m. The 
goal state is [O tt] which has the pendulum standing up 
with no velocity. 

For the cost function, we used a = 0.01, Qp = 21, Qy = 
diag([0.005,0]), P = 0.017, where 7 is the identity matrix. 
Recall that 7? is a diagonal matrix that penalizes the squashed 
control and the virtual control, as mentioned before. For 
the squashed controls, the upper left block of R, which 
is the matrix of the quadratic penalty, is 0.017. We chose 
r = 13,5 = 0.1s,Vc = 10 Hz,v, = 100 Hz. 

We can rewrite the dynamics in the form given in Eqn. 2: 


77(q,q,q) = [0 0 sin0] 


A = 


b 

\mgl 


2) Cartpole: The cart-pole system is a nonlinear system 
described by the following dynamics [13]: 


q = [0 xY 


q = 


— 3m/0^ sin 0 cos 6—6{M+m)g sin 0 — 6{u—bx) cos 0 
4/(M+m)—3m/cos^ 0 
2m/0^ sin 0+3m2^ sin 0 cos 0 +4 m— 40x 
4(M+m)—3m/cos^ 0 


The state space is 4D and the control is ID, which is the 
external force applied to the cart. M denotes the mass of the 
cart, OT 2 denotes the mass of the pole, I denotes the length 
of the pole, 0 denotes the angle of the pendulum, p denotes 
the position of the cart, b denotes the friction between the 
cart and the ground, and g — 9.8 (m/s^) is acceleration due 
to gravity. We chose M = .5 kg, m 2 = .5 kg, / = .5 m, 5 = . 1 
N/m/s, and constrainted uG [—10,10] N-m. The goal state 
is [0 0 n 0 ] which has the cartpole standing up at the 
origin with no velocity. 

For the cost function, we used a = 0.1, Qp = 
diag([l,20]), ev = diag([0.07, 0.03, 0, 3]), P = 0.017. The 
upper left block of R, for the squashed controls, is 0.017. We 
chose 7’ = 8,5=0.1s,Vc = 16.7 Hz,v, = 50 Hz. 

Since the cartpole is underactuated, we moved a term to 
the right hand side of the dynamics to replace the zero due 
to the unactuacted degree of freedom. We can rewrite the 
dynamics in the form given in Eqn. 2: 


joints. The system dynamics are: 


q=[0i O 2 Y 

ljilm+m2)+h jm2/2^icos(0i - 02) 

2/ll2/M2COs(0i - 02) \m2l2+h 

-h (jm2/202 sin( 0 i - 02) -gsin 0 i(jmi +m2)) 
\m2l2 (/i 0 i^sin( 0 i - 02) -l-gsin02) +M2 


- Ml 


Here, 0i and 02 are the joint angles, nt] and m 2 are the 
masses of link 1 and link 2, respectively. l\, I 2 are the 
lengths of the links, g is the gravity constant, 7i and I 2 
are the moments of inertia of the links, and mi,M 2 are the 
torques applied at the joints. We chose mi — m 2 = 0.5 kg 
and h = h = 0.5 m. ui,U 2 were constrained to be in the 
range [—2,2] N m. To compute the forward dynamics, we 
solve this linear equation for the second derivative of the 
joint angles. The goal state is [O 0 0 O] which has the 
double pendulum standing up with no velocity. 

Eor the cost function, we used a = 0.05, Qp = 57, Qy = 
diag([0.04, 0.04, 0, 0]), P = 0.017, where 7 is the identity 
matrix. The upper left block of R, for the squashed controls, 
is 0.017. In order to make the system stabilize near the goal, 
we increased the control penalty to 0.1 when the system was 
near the goal. We chose T = 8,5= 0.08s, Vc = 16.7 Hz, = 
50 Hz. 

We can rewrite the dynamics in the form given in Eqn. 2: 


^^(q,q,q) = 


A= [01 02cos(0i — 02 ) 02 sin(0i —02) sin0i] 
B= [01 cos(0i — 02 ) 02 0fsin(0i —02) sin 02 ] 


A = 


llilmi 


-m2)+h 


\m 2 l 2 l 1 

jm2l2h 

-gli{\mi +m 2 ) 

\m 2 l 2 h 

\m2l2+h 

-\m 2 l 2 h 

-\m 2 l 2 g 

T = [UI,U 2 Y- 


^^(q,q,q) = 


p 0COS0 0^sin0 p 0 
0 0 0 0 pcos9 


A = [M + m, -ml, —-m/,5,3,21]T 
T = [m,— 3gsin0]''' 


0 

0 


3) Double Pendulum: The double pendulum system is a 
fully actuated two link system with applied torques at the 




















